Homework 2 is due by noon, Tues 9/28. Please complete the assignment in this Markdown document, filling in your answers and R code below. I didn’t create answer and R chunk fields like I did with homework 1, but please fill in your answers and R code in the same manner as hw 1. Make sure to follow the homework guidelines when writing up this assignment (handout is located on the right side of moodle page).
Tips for using Markdown with homework sets:
Work through a problem by putting your R code into R chunks in this .Rmd. Run the R code to make sure it works, then knit the .Rmd to verify they work in that environment.
Make sure you load your data in the .Rmd and include any needed library commands.
Feel free to edit or delete questions, instructions, or code provided in this file when producing your homework solution.
For your final document, you can change the output type from html_document to word_document or pdf_document. These two to output types are better formatted for printing.
Keep the hw problems in problem order I give in this doc. You can attach hand-written work for a problem (if needed) but make it clear in this doc when you are answering a problem using work attached to your printed pdf/word doc.
Comment: Complete this problem “by hand” using the info in Display 7.9 (i.e. don’t load the data and fit a lm). Use the qt command in R to get your mutliplier \(t^*\) for the CI calculation.
Answer: As shown below, we have evidence that the effect of measured recession velocity on average measured distance is statistically significant (t = 2.073873, d.f. = 22, p = 0.0000045). We are 95% confident that a 1 km/s increase in recession velocity is associated with a 0.0008672125 to 0.0018787875 km increase in measured distance.
qt(.975,22)
## [1] 2.073873
0.001373 + c(-1,1)*qt(.975,10)*0.000227
## [1] 0.0008672125 0.0018787875
Recall the agstrat.csv data used for homework 1. This was a stratified random sample of US counties. We will consider the regression of farm acreage in 1992 (y=acres92) on farm acreage in 1982 (x=acres82).
ggplot2 package to create a scatterplot of acres92 against acres82 that includes the fitted regression line. Describe the direction, form and strenth of the relationship shown in this graph.answer: Form: Mostly linear with no visible curvature observed, Direction: Positive (as one increases the other increases as well) Strength: Strong, not much variation.
library(ggplot2)
agstrat <- read.csv("http://people.carleton.edu/~kstclair/data/agstrat.csv")
ggplot(agstrat, aes(x = acres82, y = acres92)) +
geom_point() +
geom_smooth(se = FALSE)+
labs(title = "Acres82 vs. Acres92")
acres92 on acres82. Interpret the slope in context for this problem.agstrat_lm <- lm(acres92 ~ acres82, data = agstrat)
summary(agstrat_lm)
##
## Call:
## lm(formula = acres92 ~ acres82, data = agstrat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -249540 -12517 -2286 7354 261600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.532e+03 3.202e+03 -1.728 0.085 .
## acres82 9.844e-01 7.035e-03 139.923 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41050 on 298 degrees of freedom
## Multiple R-squared: 0.985, Adjusted R-squared: 0.985
## F-statistic: 1.958e+04 on 1 and 298 DF, p-value: < 2.2e-16
ggplot(agstrat, aes(x = acres82, y = acres92)) +
geom_point() +
geom_smooth(se = FALSE)+
labs(title = "Acres82 vs. Acres92")
answer: \(\mu_{acres82|acres92} = -5532 + 0.9844(acres92), \hat \sigma = 41050\). According to this fitted model, for every 1 acreage increase in farm acreage in 1982 is associated with a 0.9844 increase in farm acreage in 1992.
answer: \(t-stat = \frac{0.9844 - 0}{0.007035} = 139.9289, p-value = 2P(T > 139.9289) = 2e-16\). Since we have a p-value that is significantly less than 0.05 (p-value = 2e-16). This means we have evidence that the estimated effect of farm acreage in 1982 on farm acreage in 1992 is statistically significant.
answer: Linearity: Looking at the scatter plot we can see that the mean response does indeed vary linearly with \(x\). Constant Variance: We can also see that the model errors are not associated with \(x\). Normality: From the histogram we can see that the \(\mu_{y|x}\) can be described by a normally distributed model. Independence: Since these values were measured over a decade, we can say that they are temporally (time) correlated and are therefore independent
library(ggplot2)
agstrat <- read.csv("http://people.carleton.edu/~kstclair/data/agstrat.csv")
agstrat_lm <- lm(acres92 ~ acres82, data = agstrat)
library(broom)
library(ggResidpanel)
agstrat_aug <- augment(agstrat_lm)
head(agstrat_aug)
ggplot(agstrat, aes(x = acres82, y = acres92)) +
geom_point() +
geom_smooth(se = FALSE)+
labs(title = "Acres82 vs. Acres92")
resid_xpanel(agstrat_lm)
hist(agstrat_aug$.resid)
plot(agstrat_lm, which = 2)
resid_panel(agstrat_lm, plots = "qq")
resid_interact(agstrat_lm, plots = "qq")
answer: 17.a) \(\mu_{pH|Time} = 6.98363 - 0.72566(acres92), \hat \sigma = 0.08226\); 17.b) standard error of the estimated mean is 0.0297; 18. SE prediction = 0.0875; CI = (5.6172,6.0208)
case0702 which is in the Sleuth3 package. Your R chunk should look likelibrary(Sleuth3)
head(case0702)
meat_lm <- lm(pH ~ log(Time), data=case0702)
predict(meat_lm,
newdata = data.frame(Time = 5),
interval = "confidence",
se.fit = T)
## $fit
## fit lwr upr
## 1 5.815725 5.747122 5.884328
##
## $se.fit
## [1] 0.02974967
##
## $df
## [1] 8
##
## $residual.scale
## [1] 0.08225969
\(0.0823 * sqrt(1 + 1/10 + (1.609 - 1.190)^2/9 * 0.6344 = 0.0875\) \(6.98 - 0.726*log(5) = 5.819\) \((5.819 +/- (2.306 * 0.0875) = (5.6172,6.0208)\)
log(TIME) as the predictor in your model: lm(pH ~ log(Time), data=case0702)predict command in R. Important note: If you log a predictor in your lm command, e.g. lm(y ~ log(x)), then you give the predict command the value of x on the original (unlogged) scale when entering a value for newdata.Sleuth3 package (ex0817).ggplot2 package let’s you easily do this without applying those functions to your variables, instead you add another layer that tells R how to scale a particular axis. This method of visualization is nice because your numerical labels on the x/y axis will still be measured in the original units of the variables. If you want to, say, look at the scatterplot of sqrt(y) against log10(x) (base-10 log) you would add the layers scale_y_sqrt() and scale_x_log10() to your scatterplot of y against x. For this example, that would look like:answer \(Y = 10.796925 - 0.012135x\)
library(Sleuth3)
head(ex0817)
library(ggplot2)
ggplot(ex0817, aes(x= Load, y = Mass)) +
geom_point() +
scale_y_sqrt() +
scale_x_log10()
ex0817_lm <- lm(Mass ~ Load, data = ex0817)
summary(ex0817_lm)
fitted(ex0817_lm)
resid(ex0817_lm)
ex0327scale layers described in Problem 4 hints.filter command from dplyr to make a version of the data set that only contains cases where DurationOfVisit < 31answer: a.) The residual plot is skewed to the right; b.) The log transformation do not help; c.) The new plot filtered with duration < 31 fits much better as shown by the graph
library(Sleuth3)
library(dplyr)
head(ex0327)
ex0327_lm <- lm(PollenRemoved ~ DurationOfVisit, data = ex0327)
library(broom)
ex0327_aug <- augment(ex0327_lm)
head(ex0327_aug)
library(ggplot2)
ggplot(ex0327_aug, aes(x = DurationOfVisit, y = PollenRemoved)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(y = "residuals", title = "Residual plot")
library(ggResidpanel)
resid_xpanel(ex0327_lm)
new_Time <- filter(ex0327,DurationOfVisit < 31)
ggplot(new_Time, aes(x = DurationOfVisit, y = PollenRemoved)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(y = "residuals", title = "Residual plot")
The (hidden) R chunk below defines a function named reg.sim2 that samples responses from a SLR model given a vector of explanatory values \(x\). You will use the following SLR model to generate your set of responses: \[
Y_{i} = 20 + 1x_{i} + \epsilon_{i} \ \ \ \ \epsilon_{i} \sim N(0,2)
\] so that \(\beta_{0}=20, \beta_{1}=1\) and \(\sigma=2\).
We start by generating a sample of 1000 explanatory variable values. Suppose the explanatory variable \(x\) is equally (uniformly) distribution between 1 and 10. Generate \(x\) and view its distribution (use whatever seed value you like):
set.seed(7)
x <- seq(from=1,to=10,length=1000)
hist(x)
Next, for each of the 1000 \(x_i\)’s that you just created, use the reg.sim2 function to generate 1000 responses \(y_i\) from the population model described at the start of this problem:
reg.sim2(x, beta0=20, beta1=1, sigma=2, grph=T)
## $b0
## [1] 20.23249
##
## $b1
## [1] 0.9588374
##
## $SE.b0
## [1] 0.1451941
##
## $SE.b1
## [1] 0.0238654
The R output gives the estimated slope and intercept (\(\hat{\beta}_{1}, \hat{\beta}_{0}\)), a scatterplot of the data (along with the sample and population regression lines), and histograms/QQnormal plots for the responses \(y_i\) and the residuals \(r_{i}\). Use the histograms and QQ normal plots to answer the following questions:
Are the sampled \(y\)s normally distributed? If not, describe the general shape of their distribution.
answer Yes, looking at the graph we can that the samples \(y\) are normally distributed.
Are the residuals normally distributed? If not, describe the general shape of their distribution.
answer Yes, looking at the graph we can that the residuals are normally distributed.
rgamma(1000, 1, 1/2).answer Looking at the graph we can that the samples \(y\) are normally distributed but slightly skewed to the right
answer Looking at the graph we can that the residuals are normally distributed.
set.seed(1)
x <- rgamma(1000, 1, 1/2)
hist(x)
reg.sim2(x, beta0=20, beta1=1, sigma=2, grph=T)
## $b0
## [1] 20.05299
##
## $b1
## [1] 1.010284
##
## $SE.b0
## [1] 0.09005439
##
## $SE.b1
## [1] 0.03204216
rnorm(1000, 10, 2).answer Looking at the graph we can that the samples \(y\) are normally distributed but slightly skewed to the left
answer Looking at the graph we can that the residuals are normally distributed.
set.seed(7)
x <- rnorm(1000, 10, 2)
hist(x)
reg.sim2(x, beta0=20, beta1=1, sigma=2, grph=T)
## $b0
## [1] 19.75534
##
## $b1
## [1] 1.028176
##
## $SE.b0
## [1] 0.3359167
##
## $SE.b1
## [1] 0.03294284
answer Depending on how the distribution of the explanatory variable \(x\), the distribution of the response can be skewed to the left or to the right.